Result:
Center Napa and West Oakland have the highest concentration of PM 2.5. East and middle parts of Bay areas are the next seriously polluted regions. West of North Bay areas such as Santa Rosa Has the lower PM 2.5. South part of Bay Areas such as Gilroy is also less polluted.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.500 8.262 8.564 8.480 8.748 10.522
# PM2.5 stacked plot (fill)
Result:
As compared to the proportions in the “Total”, there are higher proportions of White living in the regions with the tier of 10-11 PM2.5, and also the tiers of 5-6, 6-7, 7-8 PM2.5, especially. On the contrary, lower proportions of Asian living in these areas. More black or African Americans live in the regions with the tier of of 9-10 and 10-11 PM2.5.
Results:
The east side of Bay regions between Oakland and East Bay particularly have high proportions of Asthma prevalence.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.93 25.84 39.98 52.15 64.33 243.29 1
#Combine PM2.5, Asthma with race tract data
# Asthma prevalence by race stacked plot (fill)
Result: The regions with more Asthma prevalence in the levels of 100-150, 150-200, 200-250 have lower proportions of White and Asian, but higher proportions of black or African Americans, and some other race group.
## [1] 52.14577
## [1] 2453255
Result:
The scatter plot does not show a good fit since there are a lot of points lying above and away from the best-fit line.
## [1] 2453255
## $par
## [1] 19.85653 -116.23251
##
## $value
## [1] 2217584
##
## $counts
## function gradient
## 117 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
## [1] 0.0006923995
##
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_bay_pm25_Asthma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.47 -25.89 -9.61 12.94 182.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -116.278 13.040 -8.917 <2e-16 ***
## PM2.5 19.862 1.534 12.950 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.49 on 1578 degrees of freedom
## Multiple R-squared: 0.09606, Adjusted R-squared: 0.09549
## F-statistic: 167.7 on 1 and 1578 DF, p-value: < 2.2e-16
Result:
The linear regression analysis uses optimization approach which minimizes the sum of squared residuals (SSR) and gives the best fit under the assumption of a linear model. The fitted regression equation is:
Asthma prevalence = -116.278 + 19.862 * PM2.5
“An increase of “1 µg/m3” in “Annual mean concentration of PM2.5” is associated with an increase of “19.862” visits in “age-adjusted rate of ED visits for asthma per 10,000”. “9.606%” of the variation in “age-adjusted rate of ED visits for asthma per 10,000” is explained by the variation in “Annual mean concentration of PM2.5”.
## 1
## 42.6178
#Residual density Plot before log transformation
Result:
To ensure the regression line to be a good fit, the residuals from the fitted regression line need to follow a normal distribution around the “0 “mean. However, based on the residual density plot, the residual distribution is clearly right skewed instead of normal.
## [1] 2453255
## $par
## [1] 0.3566878 0.6887687
##
## $value
## [1] 680.3463
##
## $counts
## function gradient
## 65 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
## [1] 0.0005442665
##
## Call:
## lm(formula = LN_Asthma ~ PM2.5, data = ces4_bay_pm25_LN_Asthma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00402 -0.46479 0.03313 0.42298 1.75525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69234 0.22840 3.031 0.00248 **
## PM2.5 0.35633 0.02686 13.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6566 on 1578 degrees of freedom
## Multiple R-squared: 0.1003, Adjusted R-squared: 0.09974
## F-statistic: 175.9 on 1 and 1578 DF, p-value: < 2.2e-16
Result:
Based on the residual scatter plot from the regression model after log transformation, the residuals above and under “zero slope line” is much more even which implies the under- or over- estimation situations much more equally occurred. Therefore, the log transformation is essential and the regression line after log transformation is actually a better fit.
Result:
The positive residual means under-estimation and the negative residual means over-estimation. Hence, a low (and negative) residual means that the Asthma prevalence for that particular census tract is significantly over-estimated. As comparing the residual density plot after log transformation with the one before, it appears that the residual distribution has changed from right skewed to be a much more symmetric distribution after log transformation. The residual density now is somewhat more close to a normal distribution with 0 mean. Therefore, it shows the good fit of regression model after log transformation and further implies the necessity of log transformation of Asthma prevalence.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00402 -0.46479 0.03313 0.00000 0.42298 1.75525
## Simple feature collection with 1 feature and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.1737 ymin: 37.41911 xmax: -122.1492 ymax: 37.44193
## Geodetic CRS: NAD83
## # A tibble: 1 × 6
## `Census Tract` PM2.5 geometry Asthma LN_Asthma residuals
## <dbl> <dbl> <MULTIPOLYGON [°]> <dbl> <dbl> <dbl>
## 1 6085513000 8.16 (((-122.1737 37.42636, -122.1… 4.93 1.60 -2.00
Result:
The lowest residual in the regression estimation is “-2.003361” which occurs in the Census Tract “6085513000” which is approximately at Stanford in Santa Clara. The exact Asthma prevalence in this census tract is “4.93”, however, the prediction of Asthma prevalence in this area is significantly over-estimated which leads to the lowest and negative residual.